Sequence similarity between SARS-CoV-2 nucleocapsid and multiple sclerosis-associated proteins provides insight into viral neuropathogenesis following infection

The novel coronavirus SARS-CoV-2 continues to cause death and disease throughout the world, underscoring the necessity of understanding the virus and host immune response. From the start of the pandemic, a prominent pattern of central nervous system (CNS) pathologies, including demyelination, has emerged, suggesting an underlying mechanism of viral mimicry to CNS proteins. We hypothesized that immunodominant epitopes of SARS-CoV-2 share homology with proteins associated with multiple sclerosis (MS). Using PEPMatch, a newly developed bioinformatics package which predicts peptide similarity within specific amino acid mismatching parameters consistent with published MHC binding capacity, we discovered that nucleocapsid protein shares significant overlap with 22 MS-associated proteins, including myelin proteolipid protein (PLP). Further computational evaluation demonstrated that this overlap may have critical implications for T cell responses in MS patients and is likely unique to SARS-CoV-2 among the major human coronaviruses. Our findings substantiate the hypothesis of viral molecular mimicry in the pathogenesis of MS and warrant further experimental exploration.


Results
SARS-CoV-2 nucleocapsid exhibits significant homology with MS-associated proteins across both 9mer and 15mer peptide groups. We used PEPMatch 17 to determine the sequence homology between immunodominant proteins from SARS-CoV-2 and MS-associated proteins, which were compiled using the Immune Epitope Database and Analysis Resource (IEDB) (http:// www. iedb. org/) ( Table 1). We tested both 9mer and 15mer segments to contextualize MHC I and MHC II presentation, respectively. These peptide lengths correspond to experimentally validated epitope lengths associated with CD8 + and CD4 + T cell recognition, respectively 17 . Intriguingly, nucleocapsid protein shared significant overlap with MS-associated proteins in both 9mer and 15mer groups (Fig. 1A) (see Methods for full details on background controls and statistical comparisons). Spike, membrane, NS7a, and envelope proteins from SARS-CoV-2 did not share significant overlap with the list of MS proteins above their sequence-shuffled controls, and replicase polyprotein 1ab had significantly elevated peptide matches above its sequence-shuffled control only in the 15mer group ( Fig. 1B-F). Out of the 108 antigens on the list of MS-associated proteins used in this analysis, 22 proteins had sequences matching nucleocapsid across both 9mer and 15mer groups. Of note, the canonical MS-associated protein PLP shared homology with nucleocapsid from SARS-CoV-2 (Fig. 1G).

PEPMatch-predicted PLP epitope shares homology with experimentally validated MS-associated peptides and is in a region known to elicit T-cell responses in MS patients.
To contextualize the above findings, we investigated whether epitopes returned from PEPMatch had been documented in the literature. We found that of all the proteins returned from the PEPMatch analysis, PLP had the highest number of BLASTp-verified homologous sequences that have been experimentally validated and curated on the IEDB ( Fig. 2A,B). PLP has been documented to contain epitopes recognized by T cells from MS patients across numerous studies 18,19 . In one study, T cell lines derived from MS patients and activated with PLP showed the strongest reactivity against regions 40-60, 95-117, 117-150, and 185-206 out of all 9 regions of PLP tested 20 . The authors concluded from this study that these regions were largely responsible for eliciting strong T cell responses in MS patients. We found that the PLP peptides returned from PEPMatch fell within one of these immunodominant regions (Fig. 2C). Overall, these results provide a computational basis for the potential of SARS-CoV-2 to initiate T-cell-driven molecular mimicry through specific MS-associated proteins, including PLP.

MHC-binding prediction reveals other proteins of interest which may facilitate molecular mimicry beyond PLP.
Given that molecular mimicry is likely facilitated not only by sequence similarity but also by HLA haplotype 6,21 , we next investigated whether MS-associated alleles [22][23][24] would be predicted to bind to the nucleocapsid peptides returned from the PEPMatch analysis. We compared the binding propensity of MS-associated alleles to a set of alleles which represent approximately 99% of the population worldwide to contextualize the analysis 25 . We analyzed the top 50th percentile of binding predictions returned from the algorithm in order to focus on physiologically relevant binding predictions while retaining maximum information on the alleles 26 . We found that while some MS-associated alleles demonstrated high average binding capacities (represented by low average percentile rank, Fig. 3A), including HLA-DRB1*03:01, HLA-DRB1*04:04, and HLA-DRB1*08:01, there was no significant difference in average binding predictions from MS-associated alleles in comparison to the reference set of alleles (Fig. 3B). We next asked whether certain proteins matching the nucleocapsid peptides from the original PEPMatch analysis were enriched amongst the top peptide:allele binding predictions. Inspection of these peptide:allele combinations that scored a 10th percentile rank or less across all alleles revealed unique proteins enriched for predicted MHC binding (Fig. 3C). Although our focus has largely centered on the canonical MS protein PLP, this analysis highlights other potential proteins returned from the PEPMatch analysis that may trigger autoimmunity across a wide variety of HLA-haplotyped individuals, including CD99 (Fig. 3C).
Seasonal coronaviruses share significant homology with MS-associated proteins but do not overlap with PLP. Seasonal  www.nature.com/scientificreports/ mine whether the above findings were unique to SARS-CoV-2, we tested whether nucleocapsid, spike, or membrane proteins from seasonal coronaviruses 229E, NL63, OC43, and HKU1 shared significant homology with MS-associated proteins. We found that the nucleocapsid protein from 3 out of the 4 seasonal coronaviruses tested shared significant homology with MS-associated proteins, but only in the 15mer groups (Fig. 4A); due to the stricter matching parameters of the 9mer group, this suggests a higher degree of peptide similarity between SARS-CoV-2 nucleocapsid and MS-associated proteins (Fig. 1A). Nucleocapsid proteins of these 3 seasonal coronaviruses also share the highest percent identity to SARS-CoV-2 nucleocapsid as determined by BLASTp (Fig. 4B). Of note, while the coronaviruses all had some shared PEPMatch protein "hits", only PLP significantly overlapped with the nucleocapsid of SARS-CoV-2 (Fig. 4C).

Discussion
Since the beginning of the pandemic, SARS-CoV-2 has been associated with CNS sequelae with manifestations ranging from memory loss and attention deficits to demyelination 3,4,30,31 . Although viral molecular mimicry has been a long-standing hypothesis regarding the triggering of initial and recurrent episodes of MS demyelination 7 , no bioinformatic approaches had been created which consider physiological parameters necessary for fully understanding MHC presentation capacity. In this analysis, we demonstrated that SARS-CoV-2 may be associated with the development of MS using a new computational tool developed by the IEDB team that includes more robust physiological parameters in its assessment for homology. SARS-CoV-2 nucleocapsid, spike, and membrane proteins have been widely demonstrated to elicit strong immune responses 32 . Indeed, it was recently found that these 3 proteins were among the 9 viral proteins making up 83% of CD4 + T cell responses, and among the 8 accounting for 81% of CD8 + T cell responses in www.nature.com/scientificreports/ COVID-convalescent patients 33,34 . In addition, NS7a 35 , replicase polyprotein1ab 36 , and envelope 37 proteins have all been implicated in driving immune responses in individuals recovering from SARS-CoV-2. We asked whether any of these proteins shared significant homology with MS-associated neuro-antigens-an observation which would fortify the proposed case for molecular mimicry in the development of MS following SARS-CoV-2 infection. Interestingly, both the 9mer and 15mer groups (representing MHC I and II, respectively) from nucleocapsid showed significant sequence overlap with MS-associated proteins (Fig. 1A), in contrast to most of the other Figure 1. Nucleocapsid protein shares significant homology with MS-associated proteins. (A) PEPMatch was used to determine the overlap between nucleocapsid protein from SARS-CoV-2 and MS-associated proteins, which were determined using an IEDB query. For comparison, 30 iterations of shuffled-sequence nucleocapsid protein were run through the analysis with the same parameters as the intact nucleocapsid sequence, whose average was then compared to MS-associated proteins. Fisher's exact or Chi square tests were run on both the 9mer and the 15mer peptides, with up to 2 mismatches for the 9mer peptides and up to 7 mismatches for the 15mer peptides. (B-F) The same analysis was run as in (A) using the specified proteins labeled above each graph from SARS-CoV-2. (G) Shown are the proteins whose peptides significantly overlapped with nucleocapsid across both the 9mer and 15mer groups.    www.nature.com/scientificreports/ proteins tested in our analysis with the exception of replicase polyprotein 1ab, which showed significance only in the 15mer group (Fig. 1E). Given the strict parameters for amino acid homology for the 9mer group-protein sequences had to exactly match a minimum of 78% of the time-a strong overlap of nucleocapsid and MS-associated proteins highlights the potential for SARS-CoV-2-mediated molecular mimicry across both classes of MHC. Myelin proteolipid protein (PLP) has been implicated in the development of MS across a multitude of studies 20,[38][39][40] . In humans, the development of MS following Rubella virus infection was demonstrated to be linked to the high relative similarity score of E2 protein to PLP 16,41 , demonstrating the potential for sequence overlap leading to the induction of demyelinating disease. More recently, França et al. discovered that the NS5 epitope of Zika virus shared 83% sequence homology with PLP, implicating NS5 as a likely candidate for driving the development of MS and possibly other CNS inflammatory demyelinating disorders 42 . In our own study using PEPMatch, we found that an epitope from PLP shared significant homology with a nucleocapsid peptide from SARS-CoV-2 (Fig. 1G). Importantly, the PLP epitope that overlaps with nucleocapsid is associated with a high number of experimentally validated epitopes curated from the literature (Fig. 2A,B), providing evidence for the potential for SARS-CoV-2-driven molecular mimicry. Among the numerous studies that have investigated the autoantigenicity of specific epitopes of PLP in the context of MS-development 18,19 , certain epitopes have been specifically associated with eliciting strong T cell responses in MS patients 20 . The epitope returned from PEPMatch in our study, 120-134, is encompassed within an epitope associated with strong T cell responses in DR15*01-positive MS subjects 20 (Fig. 2C). Collectively, this data substantiates the hypothesis that SARS-CoV-2 nucleocapsid may be providing the basis for molecular mimicry preceding the development of MS in susceptible individuals.
In the words of Wekerle, "Molecular mimicry thus goes well beyond the simple structural resemblance of two individual peptides. It also embraces the peptide-presenting MHC…" 7 . We investigated whether nucleocapsid peptides returned from PEPMatch would be predicted to bind preferentially to MS-associated class II HLA alleles, of which more has been published in relation to MS susceptibility than for class I HLA. Using the top 50th percentile of binding predictions, the average percentile rank of all peptide:allele combinations grouped by allele demonstrated no significant pattern of binding enrichment among the MS-associated alleles (Fig. 3B). Several salient considerations must be acknowledged when interpretating these data, however. Importantly, weakly binding peptide epitopes from myelin basic protein (MBP) have been associated with eliciting strong autoimmune responses in EAE mouse models 43 , suggesting that predicting the binding propensity of peptide:allele combinations may be more complex than fully accounted for by this machine learning algorithm. This analysis also is limited in scope by the total number of HLA alleles assessed. We analyzed 31 alleles; there are over 33,000 allele and haplotype entries on the IPD/IMGT-HLA database 44 . However, our results suggest that certain HLA alleles not previously associated with MS development may increase susceptibility to CNS-related sequelae, including HLA-DRB4*01:01, though this speculation warrants further investigation.
Top ranking MS-associated alleles from our analysis, including HLA-DRB1*03:01, have not only been associated with the development of MS, but also the development of other autoimmune disorders. HLA-DRB1*03:01 has been associated with autoimmune hepatitis 45 , autoimmune encephalitis 46 , neuromyelitis optica 47 , and autoimmune Addison's disease (AAD) 48 . AAD has been reported in the literature following acute COVID-19 infection [49][50][51] , suggesting that this allele may predispose individuals of this haplotype to other acute autoimmunerelated manifestations of SARS-CoV-2 beyond CNS pathologies. In conclusion, numerous studies have been published which link specific HLA haplotypes with susceptibility of severe COVID outcomes [52][53][54] ; however, no studies exist to date which explore the haplotypes of individuals with rare CNS sequelae following SARS-CoV-2 infection. Future experimental investigation should broaden the scope of disease manifestations to better understand the potential link between HLA haplotype and SARS-CoV-2 neuropathogenesis.
To further explore the relationship between allele binding predictions and CNS manifestations, we gathered the peptides of percentile rank 10 or lower 26 and tallied the original proteins from which the peptides originated. We found a considerable enrichment of CD99 peptides among the top allele binding predictions (Fig. 3C). CD99 is a cell surface protein expressed by a wide number of tissues and organ systems, including lymphocytes, and is critical for cellular adhesion, migration, and diapedesis 55 . Recently, CD99 was implicated in exacerbated COVID-19-associated kidney injury 56 . Elution of CD99 peptides in the urine separated groups of patients with mild versus severe kidney pathology; in addition, CD99 + lymphocytes were found in significantly lower percentages in patients with severe outcomes. The authors speculated that aberrant autoimmune responses directed against CD99 may be promoting the exacerbation of kidney injury, and that this reduction of CD99 overall could indicate a loss of endothelial integrity 56 . Recently, Domizio and colleagues found that acute respiratory injury following SARS-CoV-2 infection is in part due to lung endothelial damage, which they speculated translated to other organ systems as well 57 . Neuropathogenesis following SARS-CoV-2 infection has been noted for distinct pathophysiological patterns, including loss of blood-brain barrier (BBB) integrity 58 , which is structurally maintained largely through endothelial cells 59 . Integral to these observations is the finding that Keratin Type II Cytoskeleton 6A (a protein recovered in our original PEPMatch analysis, Fig. 1G) was also found to be dysregulated in patients with severe kidney injury 56 . This keratin protein has been associated with wound healing, in which loss of this protein led to profound inabilities of mice to undergo normal wound healing processes 60 . The peptides identified from PEPMatch (Fig. 1G), and more specifically those predicted to be bound and presented on MHC (Fig. 3C) warrant further experimental investigation, as they collectively suggest a mechanism by which molecular mimicry initiated by SARS-CoV-2 infection could lead to the inappropriate targeting of proteins involved in a range of biological processes necessary for a multitude of organ systems, culminating in severe neurological and other pathological outcomes.
Molecular mimicry has been explored as a mechanism for triggering MS for decades, and several viral and bacterial pathogens have been associated with MS development 61 62,63 . In our own hands using PEPMatch, we were able to demonstrate that a larger proportion of EBV proteins share similarity with MS-associated proteins than its virologic cousin, cytomegalovirus (CMV), aligning with current views on MS etiology and substantiating the practicality and utility of PEPMatch as a new resource (Supplemental Fig. 1). Seasonal coronaviruses have also been implicated in the pathogenesis of MS using a range of bioinformatic and clinical approaches [27][28][29] . Primary T cell clones isolated from MS patients activated with HCoV-229E and HCoV-OC43 proteins cross-reacted with myelin basic protein (MBP) and PLP, highlighting the propensity for viral molecular mimicry involving seasonal coronaviruses 29 . In our study, we found that nucleocapsid protein from 3 out of the 4 major seasonal coronaviruses showed significant sequence overlap with MS-associated proteins (Fig. 4A); however, this effect was only found in the 15mer group, whose threshold for peptide sequence overlap is around 50% 17 ; this indicates a greater percentage of exact sequence matching and overall homology of SARS-CoV-2 nucleocapsid to MS proteins. In addition, we found that no myelin proteins shared homology with any coronaviruses except for PLP and SARS-CoV-2. Discussion has arisen querying whether the SARS-CoV-2 pandemic will herald an increase in MS incidence 64 . Our results provide a computational basis for this hypothesis, which should be further investigated using epidemiological approaches.
Here, we utilized a new homology-based package called PEPMatch to determine the sequence overlap between immunodominant proteins from SARS-CoV-2 and proteins associated with MS. We found that nucleocapsid significantly overlapped with MS-associated proteins, including PLP. Our work suggests that a variety of proteins may be involved in triggering autoimmunity associated with MS pathogenesis in certain individuals. We chose to focus our analysis on understanding T-cell-driven molecular mimicry, though the creation and maintenance of autoantibodies has been strongly implicated in both MS and COVID-19 severity 38,65 . Recent reports have shown that nucleocapsid is critical in driving humoral immunity in both SARS-CoV 66 and SARS-CoV-2 67 . Future integrated experimental and computational efforts should focus on understanding the full breadth of autoimmunity following SARS-CoV-2 infection, including the involvement of other organ systems and both adaptive immune arms, for a more comprehensive understanding of pathological sequelae of SARS-CoV-2.

Methods
Protein compilation. MS-associated antigens were compiled using the Immune Epitope Database and Analysis Resource (http:// www. iedb. org/). The search included the following parameters: "Organism: Homo Sapiens", "Include Positive Assays", "No B Cell Assays", "Disease Data: Multiple Sclerosis (DOID:2377)", and "MHC Restriction Type: Class I" (class I restriction was added to ensure that many proteins associated with both class I and class II were included in the analysis, as the vast majority of the 1200 + proteins discovered without filtering were associated only with class II). Finally, proteins were added which have been associated strongly with MS in the literature that did not originally appear in the IEDB filtering 68 . This list was trimmed down to proteins which had updated UniProt IDs, leading to the list of 108 MS-associated antigens used in this analysis. This list, with UniProt IDs, can be found in Table 1. Complete list of query proteins in this analysis, including proteins from SARS-CoV-2 and seasonal coronaviruses, as well as EBV and CMV proteins, can also be found in Table 1.
Homology assessment. Homology between SARS-CoV-2 immunodominant proteins and the list of MSassociated proteins was conducted primarily using PEPMatch and BLASTp. All code utilized in this analysis can be found on GitHub using the following link: https:// github. com/ mad-scien tist-in-train ing/ PEPMa tch_ SARS-CoV-2_ MS.
PEPMatch. PEPMatch is a homology-based algorithm developed by Daniel Marrama 17 and is freely available to use on GitHub: https:// github. com/ IEDB/ PEPMa tch. PEPMatch was utilized to preprocess the list of MS-associated proteins and query proteins (found in Table 1). Custom python scripts were created to utilize PEPMatch and perform all other processing necessary to run the package (see Homology Assessment for link to GitHub page). Parameters were set to preprocess all data sets separately into 9mer peptides and 15mer peptides, with 2 and 7 mismatches respectively. PEPMatch output for all significant tests is available in Supplemental Table 1.

BLASTp.
To determine homology between PEPMatch-predicted peptides overlapping with nucleocapsid and experimentally validated epitopes ( Fig. 2A), a list of epitopes for each protein was compiled using the IEDB. The search parameters included "Antigen: Protein_of_interest (Homo sapiens (human))", "Include Positive Assays", and "Host: Homo sapiens (human)". Standard Protein BLAST (BLASTp) was used to determine the homology between all PEPMatch-predicted epitopes and experimentally validated peptides for each protein. To normalize, the total number of homology "hits" returned from BLASTp was divided by the number of peptides predicted from PEPMatch for each protein.
Example BLASTp analysis: PLP. BLASTp was also used to determine the percent identity between nucleocapsid from SARS-CoV-2 and seasonal coronaviruses using the standard, recommended parameters. www.nature.com/scientificreports/ Statistics. For statistical comparisons, each query protein (spike, nucleocapsid, and membrane, NS7a, replicase polyprotein1ab, and envelope) was separately processed to provide "background" number of hits against MS-associated proteins. Specifically, custom python scripts (found on the GitHub page) were created to shuffle the amino acid sequence of each query protein, which were run to determine the number of matches with MSassociated proteins as background matching signal. Duplicate matches (for example, if identical peptides from 2 separate MS-associated proteins matched a SARS-CoV-2-derived peptide) were removed prior to statistical testing. For each query protein tested, 30 iterations of unique shuffled comparisons were run, which were then averaged and used in a Fisher's exact or Chi-square test for statistical comparison. All analyses were run conservatively with two-tailed parameters; however, if the number of random matches exceeded the number of matches of the intact peptide, the p value was assumed to be 1. The following is an example of the statistics utilized for each protein in this analysis:  Allele assessment. The MHC binding prediction machine learning algorithm from the IEDB was used to determine whether MS-associated alleles were predicted to preferentially bind and present nucleocapsid peptides in comparison to a reference set of alleles. MS-associated alleles were curated from the literature [22][23][24] and used in comparison to an allele set from the IEDB that covers 99% of the population 25 . Nucleocapsid peptide "hits" from PEPMatch were used in the query against the total set of alleles (MS and reference) using the IEDB recommended 2.22 standard parameter. As most binding predictions were skewed heavily towards high percentile ranks, the top 50% percentile of all matching predictions were focused on to capture information for all alleles input while also providing physiologically relevant binding information. In Fig. 3C, the nucleocapsid peptide-MHC binding predictions of percentile rank 10 or less 26 were cross-referenced with the original PEP-Match output to determine the homologous MS-associated proteins, and the frequency of peptide matches were tabulated and graphed.

Data availability
The authors declare that all data in support of the main findings of this study are available within the paper and its supplementary information files. All other data (including raw data generated in the supplemental findings) are available upon reasonable request to the corresponding author. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.